’’

’’

ACKNOWLEDGEMENTS: This work would not have been possible without the patience and insight of Ben Stewart (GOST). shittyPandasOpen() might be his finest contribution to the Bank yet.

The purpose of this guide is to describe how the script ‘RoadlabPro_Doctor’ works, which can be found here: https://github.com/worldbank/GOST_PublicGoods/tree/master/RoadLabPro . This guide is designed to be read with the script open for reference. Click this link to access the script before continuing.

Section 1: Setup

Here, we set the location of the data as the variable data_location. This should be a string-type filepath to a single folder, containing beneath it any file structure. This is a key variable to adjust if trying to reproduce the results locally.

Section 2: Diagnosis: Zipped Data

We use the glob2 library to iterate through the entire folder structure of datalocation, in UNIX fashion. It identifies in data_location all of the sub files and sub folders.

The second cell of Section 2 defines a function “PathDrDiagnose”. This function takes two inputs:

The function will return a count of the zipped folder paths, .csv files, .kml files and .rar folders to give the user a sense of the size of the data they are working with.

Section 3: Unzipping

In order to work with Roadlab Pro data, it needs to be fully unzipped. For this, we define a function PathDrUnzip. This function works through a given folder path, unzipping and extracting files as it finds them. It is recursive in nature - if it identifies a zipped folder inside a zipped folder, it will extract the contents of that nested-zip as well.

This function takes several arguments:

This function works by having a simple unzip function nested within the parent function, which it calls when appropriate. This function also takes care of the deletion of the unzipped folder once its contents have been extracted. The simple unzip looks like this:

def unzip(pth, in_path, unzip_path, delete = True):
    error_counter = 0
    subzip = []
    
    try:
        with zipfile.ZipFile(pth) as z:
            dirname = os.path.splitext(pth)[0]
            if os.path.exists(dirname):
                pass
            else:
                if collapse == True:
                    middle = os.path.relpath(pth, in_path)
                    middle = middle.split('\\')[:collapseDepth]
                    loc = os.path.join(unzip_path, *middle[:-1])
                else:
                    loc = os.path.join(unzip_path, os.path.splitext(os.path.relpath(pth, in_path))[0])
                if os.path.exists(loc):
                    print('passing - %s exists' % loc)
                    pass
                else:
                    os.mkdir(loc)
                for i in z.namelist():
                    if i[-4:] == '.zip':
                        print('adding subzip: ',os.path.join(loc, i))
                        subzip.append(os.path.join(loc, i))
                    print(i)
                    z.extract(i, loc)

        if delete == True: 
            os.unlink(pth)
    
    except:
        print('ERROR: cannot unzip %s' % pth)
        error_counter += 1
     
    return subzip, error_counter

The recurisve nature is generated by:

for i in z.namelist():
    if i[-4:] == '.zip':
        print('adding subzip: ',os.path.join(loc, i))
        subzip.append(os.path.join(loc, i))
        

Here, the function is adding any of the paths contained within the .zip file that terminate in ‘.zip’ to a list object (subzips). This is then returned at the end of the function. The actual extraction occurs within the function < (z.extract(i, loc)) >, so this keeps the return limited to just errors and nested .zip paths.

This is useful as the top level function is iterating through a list object called ‘remnants’, to which the ‘subzips’ list is appended constantly. In effect, remnants expands and contracts throughout the loop:

for zippy in zipped_paths:
      
      print('unzipping %s' % zippy)
      remnants, new_errors = unzip(zippy, in_path, unzip_path, delete)    # top level unzip occurs here
      error_counter += new_errors
      
      while len(remnants) > 0:        # continue this loop until no more nested .zip files found
          
          for sub_zippy in remnants:    # work through each .zip file path in remnants
              
              print('unzipping sub-zip: %s' % sub_zippy)
              res, new_errors = unzip(sub_zippy, in_path, unzip_path, delete)
              remnants.remove(sub_zippy)
              error_counter += new_errors
              
              if len(res) > 0:
                  
                  print('further subfolders found')
                  # here we add the contents of res (nested zip folders!) to remnants
                  remnants.append(*res)    
                  

In this way, we can be sure to unpack all files completely.

Section 4: Execute Unzipping

We confirm this assertion in this section, as we call PathDrDiagnose either side of PathDrUnzip. The PathDrDiagnose() function returns the list of .zip folders found, which means we can be sure we have hit them all.

print('pre unzip:')
PathDrDiagnose(test_path)

This should a text print that looks something like this:

’’

’’

This line calls the unzipper itself:

PathDrUnzip(test_path, test_path, delete = True, collapse=True, collapseDepth=2)

Finally, we check our results with another call to PathDrDiagnose:

print('\npost unzip:')
PathDrDiagnose(test_path)

Which should show something similar to:

’’

’’

N.B. Due to the fact there were only 2 .rar folders, these were unzipped by hand. PathDrUnzip currently does not support .rar decompression.

Section 5: Diagnosis: Unzipped Data

With our folder structure adjusted and .zipped folders unzipped, we can now begin to work with the underlying data.

The files encountered in April 2019 included traces of two broad types:

Trace type folder contents sub_folder contents
New

sub_folder

RoadSummary.csv

Bumps.csv

Bumps.kml

Path.csv

Path.kml

Roughness.csv

Roughness.kml

Old sub_folder

Bumps.csv

Bumps.kml

RoadPath.csv

RoadPath.kml

Intervals.csv

Intervals.kml

Simplified file names are provided above by way of illustration; actual traces often contain supplementary data of different formats eitherside of these key words, including the type of phone that collected the data, the date, and sometimes the ID of the road.

The Old and New traces looks similar but are subtely different.

The “Path” and “RoadPath” files both serve the same function - to record the location and timestamp of the phone whilst RoadLabPro is running, and nothing more. This is the canonical definitions of the geometry travelled for that trace, and should always form the basis of the visualized geometry.

The “Interval” and “Roughness” files are similar in that they both contain the roughness information recorded for the roads - in terms of a roughness proxy for the International Roughness Index (IRI), and a categorical descriptor of the surface quality.

However, the contents of these files are quite different, with different column header names, and separate approaches to ‘no data values’. In New style traces, a ‘Not Measured’ field is included in the Roughness.csv files when the vehicle is travelling below the required speed for an accurate measurement; in Old style traces, no interval is recorded at all. As such, the processing strategy for each type of trace needs to be different.

Further, the New traces often include a RoadSummary.csv file, which includes pre-calculated aggregate statistics for the road trace. However, there can be a many-to-one relationship between RoadSummary files and Path files, especially if the users created ‘links’ in their projects (sub-categorizations of a given VPROMMS), which prevents us from simply joining the RoadSummary to the Path file.

As such, the processing strategy adopted was:

Therefore, the next step was to identify all of the file paths for the .csv files that corresponded to these naming conventions:

road_summaries, intervals, roadpaths, roughness, paths = [], [], [], [], []

path = data_location
all_paths = glob2.glob(path+r'\**')
for z in all_paths:
    if 'RoadSummary' in z and '.csv' in z:
        road_summaries.append(z)
    elif '_Roughness_' in z and '.csv' in z:
        roughness.append(z)
    elif '_Path_' in z and '.csv' in z:
        paths.append(z)
    elif 'Intervals_' in z and '.csv' in z:
        intervals.append(z)
    elif 'RoadPath_' in z and '.csv' in z:
        roadpaths.append(z)

We organize this data by creating an object called a “trace” object. This is the basic building block of the processing strategy, a class we will define shortly. In order to construct a valid trace object, there must be both a ‘Path’ style file and a ‘Roughness’ style file.

Given this condition, we iterate through only one of these lists (‘roughness’) and create a new list of the directories of each file (‘new_style_RLP_roots’), and then iterate through these directories, aiming to identify a ‘path’ style file in the same folder. These path / roughness filepath pairs are then added to a master list of valid inputs for a trace, ‘valid_new_traces’:

# New Style Traces
new_style_RLP_roots = []
for fil in roughness:
    root = os.path.join(*fil.split('\\')[:-1])
    subs = glob2.glob(root.replace('C:Users',r'C:\Users')+r'\**')
    counter = 0
    for sub in subs:
        if '_Roughness_' in sub and '.csv' in sub:
            counter = 1
    if counter == 1:
        new_style_RLP_roots.append(root)

valid_new_traces = []
for new in new_style_RLP_roots:
    subs = glob2.glob(new.replace('C:Users',r'C:\Users')+r'\**')
    intervals_counter, roadpaths_counter = 0, 0
    for sub in subs:
        if '_Roughness_' in sub and '.csv' in sub:
            interval_fil = sub
            intervals_counter = 1
        elif '_Path_' in sub and '.csv' in sub:
            roadpath_fil = sub
            roadpaths_counter = 1
    if intervals_counter == 1 and roadpaths_counter == 1:
        valid_new_traces.append([roadpath_fil, interval_fil])

The code block thereafter deploys the same logic but for the old-style traces.

Section 6: Defining Class Objects

The Processing section provides the two main class objects which are used to homogenize the input data. These are NewTrace and OldTrace, and for the most part they offer very similar functionality. The components of the class will be broken down in order.

init(self, path_list)

As previously mentioned, the class object can only be instantiated if passed two filepaths - the first to the path / ‘geometry’ file, and the second to a file describing the roughness measurements made for the road. These are set as the first two attributes of the trace, self.path and self.roughness.

shittyPandasOpen(self, fileName)

Due to the nesting within the original data’s file structure, and the presence of Vietnamese characters in the filename, pandas’ read_csv() function does not work with all of the files we have to work with. As such, we designed and implemented a substitute function which aims to mirror the functionality of the pandas read_csv(). The use of:

fileName = r"\\?\%s" % fileName

…allows us to skirt the 256 character limit for filepaths on MS Windows.

load(self)

The load function attempts to open self.path and self.roughness filepaths using the shittyPandasOpen() function defined in the Class (either NewTrace or OldTrace classes). It also attempts to standardize some of the formatting of the column headers, e.g. by forcing them to upper case:

self.path_df.columns = [x.upper() for x in self.path_df.columns]

removing spurious errors (’Latidude –> Latitude):

if self.path_df.columns[3] == 'POINT_LATIDUDE':
        inColumns = list(self.path_df.columns)
        inColumns[3] = 'POINT_LATITUDE'
        self.path_df.columns = inColumns
        

…and creating shapely geometry objects from the (longitude, latitude) coordinate pair:

self.path_df['Points'] = list(zip(self.path_df.POINT_LONGITUDE.astype(float), 
                                          self.path_df.POINT_LATITUDE.astype(float)))
        self.path_df['Points'] = self.path_df['Points'].apply(Point)
    

Error handling is also introduced to try and characterize why this process might fail. Common errors include:

  • file exists in the file structure, but is effectively blank and contains no rows:

    if len(self.path_df) < 2:
          raise ValueError('Error: Path DataFrame contains one or fewer entries - cannot generate a line!')
  • corruption - some fields may be missing, e.g. ROAD_IDENTIFICATION, which is needed to specify the VPROMMS ID for new-style traces.

    try:
        if self.path_df.ROAD_IDENTIFICATION.iloc[0] == self.roughness_df.ROAD_IDENTIFICATION.iloc[0]:
            self.road_ID = self.path_df.ROAD_IDENTIFICATION.iloc[0]
        else:
            raise ValueError('Error: Path and Roughness files return different IDs!!')
    except:
        raise ValueError('Error: no "ROAD_IDENTIFICATION" field in DF - file corrupted')

The end result of the load function is the definition of three new class attributes:

  • self.roughness_df - a pandas DataFrame object of the self.roughness file path

  • self.path_df - a pandas DataFrame object of the raw GPS traces

  • self.path_geometry - this is a simple LineString of the list of points contained in self.path_df.Points. It attempts to make the LineString object as long as the number of rows in self.path_df is more than 2, as you need at least two points to make a line:

    if len(self.path_df) < 2:
                raise ValueError('Error: Path DataFrame contains one or fewer entries - cannot generate a line!')
                self.path_geometry = None
            else:
                self.path_geometry = LineString(list(self.path_df.Points))

checkGeometry(self, source = ‘epsg:4326’, target = ‘epsg:3405’, thresh = 300, min_coords = 4, min_length = 200)

RoadLab Pro traces come unfiltered and uncleaned for GPS errors. This presents a problem - as the time taken for the GPS location to ‘stabilize’ results in the first few coordinate (sometimes) being many meters from the real position - especially if this is the first recorded trace of the day on that phone. An example trace is visualized below:

’’

’’

The purpose of this function is to take the input geometry, self.path_geometry, and then analyze each link in the chain of points to detect pairs of points which are far away from one another. It flags ‘breaks’ in the line according to this logic, using the thresh parameter for definition of ‘too far away’. The user should provide the source and target epsg codes for the analysis in the format demonstrated; this allows for the reprojection of the input geometry, which natively comes in WGS 84 (or epsg:4326), to the UTM zone of operation (the target epsg). As this analysis was designed for Vietnam, the default settings are tuned for this case. The projection is provided by the partial and pyproj libraries. These operators are defined as so:

project_WGS_UTM = partial(
                pyproj.transform,
                pyproj.Proj(init=source),
                pyproj.Proj(init=target))

project_UTM_WGS = partial(
                pyproj.transform,
                pyproj.Proj(init=target),
                pyproj.Proj(init=source))

…then a little later these operators are ‘called’ to operate on a shapely object using ‘transform’:

proj_seg = transform(project_WGS_UTM, segment)

If a break (or several breaks) are detected, the LineString is split into several smaller geometry objects. in order to do this, we construct the most simple line possible between each pair of point coordinates on the line, and measure its length.

  for i in range(1, (len(coord_list))):
      st_coord = coord_list[i-1]
      end_coord = coord_list[i]
      segment = LineString([Point(st_coord),Point(end_coord)])
      proj_seg = transform(project_WGS_UTM, segment)
    
      if proj_seg.length > thresh:
          breaks.append(i)

Each of these new objects could be as short as a single point (if a break was detected either side of this point in the line) or very long. Hence, these geometries must be individually checked for validity. Validity is appraised against two criteria:

  • Number of individual points in the new geometry object. The corresponding parameter for this is min_coords. Unless there are at least ‘min_coords’ in the coordinate string, the new geometry will be determined to be invalid, and dropped from the analysis.

    if len(breaks) == 0:
        pass
    
    else:
        new_geoms = []
        breaks.append(len(coord_list))
        st = 0
        for i in breaks:
            coord_seq = coord_list[st:i]
            if len(coord_seq) > min_coords: 
                new_feature = LineString(coord_seq)
                new_feature_UTM = transform(project_WGS_UTM, new_feature)
                if new_feature_UTM.length > min_length:
                    new_geoms.append(LineString(coord_seq))
            st = i
  • Length (m) of the new object (obviously this will be 0 if it is a single point!). If the geometry is shorter than ‘min_length’, it is also considered invalid, and dropped. This is the ‘if new_feature_UTM.length > min_length:’ line in the above code block.

After this process has completed, for each trace, we can find ourselves in three separate scenarios:

  • There is only one valid geometry. In this case, self.new_geometry is set to this single LineString. This is equivalent to ‘lopping off’ any bad points that were collected:

    if len(new_geoms) == 1:
        self.new_geometry = new_geoms[0]
  • There are multiple valid geometries: this occurs if a user collected a segment of road correctly, then collected another segment in the same run somewhere else. This is likely invalid data, but there is no way of telling which of the two traces is correct. Hence, the workaround is to generate a MultiLineString object. This is a shapely object type that can include more than one LineString object, which do not necessarily have to touch. Think of this like a ‘broken line’. self.new_geometry is set to the MultiLineString:

    elif len(new_geoms) > 1:
        self.new_geometry = MultiLineString(new_geoms)
  • There are no valid geometries: at this point, these traces are dropped as we cannot determine a valid geometry for them. This would happen if all points in the coordinate sequence were at least (thresh) metres apart.

    else:
        print('invalid geometry! (%s) geoms' % len(new_geoms),self.road_ID,self.path)
        self.new_geometry = None

conditionSummary(self, mapper)

The purpose of this function is to generate the summary info about the trace that we want to take forward to the final output file.

Three important attributes of our trace object are set within this function:

  • self.condition: the RoadLab Pro app makes an estimate of the overall surface condition, and assigns it a qualitative rating. It does this for each interval it measures. Values that the condition can take include the keys in the following dictionary:

     mapper = {'NOT MEASURED':None,
             'VERY POOR':0,
             'POOR':1,
             'FAIR':2.0,
             'GOOD':3.0,
             'VERY GOOD':4.0,
             'EXCELLENT':5.0}

We pass this fixed dictionary into the function as the ‘mapper’ argument during the execution (Section (7))

In order to generate an average for a whole road segment, we first map these qualitative values to numerical values:

self.roughness_df.CATEGORY = self.roughness_df.CATEGORY.map(mapper)

…then take the distance-weighted average:

total_interval_length = self.roughness_df.DISTANCE.sum()
self.roughness_df['WEIGHTED_CAT'] = self.roughness_df.DISTANCE * self.roughness_df.CATEGORY
average_state = self.roughness_df.WEIGHTED_CAT.sum() / total_interval_length

and assign the rounded result to a new attribute, self.condition:

self.condition = (round(average_state))
  • self.IRI_average: we perform a very similar process to the IRI scores, with the exception of converting them to a numerical value, as these are already recorded in numerical form:

    self.roughness_df.IRI = self.roughness_df.IRI.astype(float)
    self.roughness_df['WEIGHTED_IRI'] = self.roughness_df.DISTANCE * self.roughness_df.IRI
    average_state = self.roughness_df.WEIGHTED_IRI.sum() / total_interval_length
    self.IRI_average = (average_state)
  • self.suspension - the suspension value remains constant for the entire trace, and serves as a record of the phone’s settings when gathering the data. We set the property to the first instance:

    self.suspension = tr.roughness_df.SUSPENSION.iloc[0]

Special: New vs. Old traces

Until now, the processing of Old traces and New traces has been largely identical; this can be confirmed by looking at the construction of the NewTrace() and OldTrace() class definitions.

However, an additional step is required for older traces. Specifically: New traces include a field in the files themselves labelled ‘Road_ID’. This makes associating data to a VPROMMS ID trivially easy. To extract the VPROMMS ID of the trace and link it to the trace object, we just look for the value in the first row of the path dataframe, as long as it is the same as that in the interval dataframe:

try:
    if self.path_df.ROAD_IDENTIFICATION.iloc[0] == self.roughness_df.ROAD_IDENTIFICATION.iloc[0]:
    self.road_ID = self.path_df.ROAD_IDENTIFICATION.iloc[0]

Old traces on the other hand do not have this feature. Often, the VPROMMS ID is contained in the filepath - the folder may be labelled with the VPROMMS ID, not even the filename itself. Hence, an additional step we take for Old traces is to run self.searchPathforVPROMMS(). This uses regex, a python library, to search the filepath for a pattern.

def searchPathforVPROMMS(self):
    search_res = re.search(r"[1-9][1-4][1-4].*\\", self.path)

We know from our work in Vietnam that VPROMMS IDs have a certain format. All such IDs begin with 3 consecutive numbers, the second two of which are usually smaller numbers between 1 and 4. We can search one of the two filepaths of the trace for expressions which meet that criteria, and then split the filepath on the presence of backslashes to isolate the term we want. We can then set this to the self.road_ID property. Otherwise, if we can’t find a suitable VPROMMS in the path, we simply set the road ID to the filepath, as we can do nothing more in this circumstance:

if search_res == None:
    base_pth = r'C:\\Users\\charl\\Documents\\GOST\\Vietnam\\unzipped'
    self.road_ID = os.path.relpath(self.path, base_pth) 
else:
    self.road_ID = search_res.group().split('\\')[0]

Section 7: Execution

In this section, we iterate through our prepared lists of new and old traces, and generate traces objects acoording to the NewTrace and OldTrace class definitions defined in Section 6.

The created trace objects get added to two lists: processed_new_traces and processed_old_traces accordingly.

A few helpful variables are added here for tracking progress:

This is also the area to define overwrite for the default values for the thresh, mincoords and minlength variables as used in the checkGeometry() method of the OldTrace and NewTrace classes.

As the processes are slightly different for Old and New traces, the majority of this code block is error handling. The practical processing is entirely contained within:

for trace_candidate in valid_new_traces:
    try:
        trace = NewTrace(trace_candidate)
        trace.load()
        trace.checkGeometry(thresh = thresh, min_coords = min_coords, min_length = min_length)
        trace.conditionSummary(mapper)
        processed_new_traces.append(trace)

much of the above should be familiar from the explanation in Section 6 - we are simply calling the methods of the trace class in order, then bundling them into a list object to deal with later. Similarly for Old traces:

for trace_candidate in valid_old_traces:
    try:
        trace = OldTrace(trace_candidate)
        trace.load()
        trace.checkGeometry(thresh = thresh, min_coords = min_coords, min_length = min_length)
        trace.searchPathforVPROMMS()   # Additional Old-only step
        trace.conditionSummary(mapper)
        processed_old_traces.append(trace)
    

Finally, we collect the errors that have been occuring along the way, and bundle those into a DataFrame:

processing_errors = pd.DataFrame(error_paths, columns = ['FilePath','Error'])
processing_errors['road_ID'] = 'trace never formed - cannot determine road_ID'

As these errors occured during the processing phase, we do not know the road ID - as the trace object was never successfully formed. Hence, we add information in the ‘road_ID’ column for all errors collected up to this point.

If running successfully, the script will print out the errors it encounters, as well as summary stats for the number of traces of each type that it generated:

!(C:493355.png)

Section 8: Form DataFrame of Processed traces

This section iterates through the processed traces to form the rows of the final output DataFrame. We want each trace object to be represented by a single row in the final DataFrame. The trace instances are very useful for keeping all of the information ordered and together. As such, forming the final DataFrame is straightforward:

rows = []
for trace in processed_new_traces:
    rows.append({
        'province':os.path.relpath(trace.path, base_pth).split('\\')[0],
        'road_ID':trace.road_ID,
        'geometry':trace.new_geometry,
        'condition':trace.condition,
        'IRI_average':trace.IRI_average, 
        'suspension':trace.suspension,
        'trace_type':'new_trace',
        'FilePath':os.path.relpath(trace.path, base_pth)
    })
for trace in processed_old_traces:
    rows.append({
        'province':os.path.relpath(trace.path, base_pth).split('\\')[0],
        'road_ID':trace.road_ID,
        'geometry':trace.new_geometry,
        'condition':trace.condition,
        'IRI_average':trace.IRI_average, 
        'suspension':trace.suspension,
        'trace_type':'old_trace',
        'FilePath':os.path.relpath(trace.path, base_pth)
    })
df = pd.DataFrame(rows)

We also define a new ‘geom_class’ column, which records the geometry type of each line. As mentioned above, it is likely that entries with a geometry type of MultiLineString represent poor quality data; as such, we provide the means of making that distinction in the DataFrame.

df = pd.DataFrame(rows)
df['geom_class'] = df['geometry'].apply(type).astype(str)
df['geom_class'] = df['geom_class'].map({r"<class 'shapely.geometry.linestring.LineString'>":'Line',
                     r"<class 'shapely.geometry.multilinestring.MultiLineString'>":'MultiLine',
                    r"<class 'NoneType'>":'NoneType'})

Finally, we move any rows with a geometry type of ‘None’ to the errors DataFrame - these are the traces where the cleaning process determined that there were no valid geometries:

null_geom_entires = df[['FilePath','road_ID']].loc[df.geom_class == 'NoneType']
null_geom_entires['Error'] = 'Error: Invalid geometry - no points within %s meters of eachother' % thresh
errors = pd.concat([processing_errors,null_geom_entires], axis = 0)

Section 10: Save Down

At long last, we are in a position to save down our three output files:

Troubleshooting

For help, email charlesjefox@btinternet.com or GOST@worldbank.org